Skip to content

Conversation

@lvhan028
Copy link
Collaborator

@lvhan028 lvhan028 commented Oct 20, 2025

Motivation

The current transport protocol between the async_engine and the inference engine causes 5+% performance degradation when logprobs are requested. This is because the protocol transmits the entire cumulative sequence of generated tokens in each iteration, resulting in redundant data transfer and processing latency.

Modification

To eliminate this redundancy, the protocol has been modified to transmit only the newly generated tokens and their associated metadata (e.g., logprobs) in each iteration.

Benchmark on H800

Serve a model by pytorch engine:

lmdeploy serve api_server Qwen/Qwen3-8B --backend pytorch --logprobs-mode raw_logprobs --enable-metrics

Benchmarked the /generate endpoint using https://gist.github.com/irexyc/add84faadbfdc229f28c7da3cf0d3ce8

python profile_restful_api.py --backend lmdeploy --dataset-path /nvme1/shared/ShareGPT_V3_unfiltered_cleaned_split.json --dataset-name random --random-input-len 170 --random-output-len 2048 --random-range-ratio 0.9  --num-prompts 1024 

Before:

============ Serving Benchmark Result ============
Backend:                                 lmdeploy  
Traffic request rate:                    inf       
Successful requests:                     1024      
Benchmark duration (s):                  319.86    
Total input tokens:                      165130    
Total generated tokens:                  1992686   
Total generated tokens (retokenized):    0         
Request throughput (req/s):              3.20      
Input token throughput (tok/s):          516.26    
Output token throughput (tok/s):         6229.89   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   214800.21 
Median E2E Latency (ms):                 220168.95 
---------------Time to First Token----------------
Mean TTFT (ms):                          2856.35   
Median TTFT (ms):                        2831.78   
P99 TTFT (ms):                           4512.24   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          109.07    
Median TPOT (ms):                        112.41    
P99 TPOT (ms):                           165.81    
---------------Inter-token Latency----------------
Mean ITL (ms):                           942.11    
Median ITL (ms):                         380.80    
P99 ITL (ms):                            1191.39   
==================================================

After:

============ Serving Benchmark Result ============
Backend:                                 lmdeploy  
Traffic request rate:                    inf       
Successful requests:                     1024      
Benchmark duration (s):                  305.74    
Total input tokens:                      165130    
Total generated tokens:                  1992686   
Total generated tokens (retokenized):    0         
Request throughput (req/s):              3.35      
Input token throughput (tok/s):          540.10    
Output token throughput (tok/s):         6517.59   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   204367.81 
Median E2E Latency (ms):                 209678.84 
---------------Time to First Token----------------
Mean TTFT (ms):                          2798.76   
Median TTFT (ms):                        2643.68   
P99 TTFT (ms):                           4446.10   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          103.68    
Median TPOT (ms):                        106.52    
P99 TPOT (ms):                           158.08    
---------------Inter-token Latency----------------
Mean ITL (ms):                           560.02    
Median ITL (ms):                         225.71    
P99 ITL (ms):                            766.30    
==================================================

@lvhan028 lvhan028 mentioned this pull request Oct 22, 2025
3 tasks
@lvhan028 lvhan028 merged commit 4af69f2 into InternLM:main Oct 23, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants